We have an agent-based model of fishing. We can use it to predict the consequence of imposing a policy. So far we used this for two different tasks:
By scenario evaluation I mean taking the model, inputting a policy and see what it does. By policy optimization I mean looking for the best parameters of a given rule (say, what is the best quota amount to impose for each species of fish if we want to maximize cash returns over 20 years).
The main weakness of this approach is that it is fundamentally open loop: the rule is set at the beginning and remains constant throughout the run. This means that the policy is brittle to shocks as it never uses any information that is gathered while the model runs.
What we would like instead is closed loop rules where the information generated by the model can be used to adjust the policy (for example sensing a recruitment boom the quotas could be increased). This was our old idea of having a “policy-maker” agent within the model
For our agent-based model there are two ways of accomplishing this:
Policy search is what we did when simulating adaptive taxation: the modeler comes up with an adaptive policy (in that case a threshold tax) and then the optimizer searches for the policy parameters much like the open loop example.
The hard case however is when we don’t really know a priori what would be a good policy rule. We might have some indicators about the fishery and some actions we can take but no idea on how to connect the two. In a very simple mathematical model we could map indicators to actions by dynamic programming, solving the Bellman equation. Agent based models however aren’t amenable to that work as the transition probabilities are hard to compute. We could run the model a large number of times and discover the transition probabilities but that would also be hard if we only have a few indicators.
What we do instead is to try and approximate the dynamic programming’s value function directly by running the model many times and iteratively improve our strategy.
For example, imagine that our fishery indicators are the biomass left and the month of the year. Our only action is whether to open the fishery that month or not. The only reward we care about is the amount of cash made by fishers. If we run the model randomly opening and closing the season we might produce a data-set like this:
| Biomass | Month | Action | Reward | Biomass after | Month after |
|---|---|---|---|---|---|
| 100 | 1 | open | 100 | 80 | 2 |
| 80 | 2 | close | 0 | 80 | 3 |
| 80 | 3 | open | 90 | 65 | 4 |
| 65 | 4 | open | 80 | 30 | 5 |
We can predict the reward produced by an action current indicators by running the following regression \[ R = \beta_0 + \beta_1 \text{Biomass} + \beta_2 \text{Month} \] twice, one for the observations where we pick action open and one where we pick close. Then in principle we could choose each month the action which the regressions predict to carry the highest reward.
The problem with this naive approach is that it is very short-term oriented. In this example the rewards are always 0 when the fishery is closed so that it will always be better to keep the fishery open. Rather than predicting the current reward alone then we would rather predict the sum of all future rewards (discounted by \(\gamma\)) after taking an action given certain indicators: \[ R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \dots \] This however is hard to do because the rewards in the future depend on other actions we will take later.
We want to focus on the value function \(V(S)\) which is the sum of all rewards we would achieve if we see indicators \(S\) and take the “optimal” sequence of choices from then on. \[ V(S) = \max_{a_t,a_{t+1},\dots} R_t(a_t) + \gamma R_{t+1}(a_{t+1}) + \dots \] And as with dynamic programming we can turn this recursively as: \[ V(S) = \max_{a_t} R_t(a_t) + \gamma V(S_{t+1}) \] The value function is unobservable but we can start with a guess \(\bar V\), acting according to that guess and use regressions to improve it over time. We are trying to approximate the value function as follows: \[ V = \beta_0 + \beta_1 \text{Biomass} + \beta_2 \text{Month} \] And we run regressions where the \(y\) is always the observed reward plus our guess \(R + \gamma \hat V\) where \(\hat V\) updates each time we have a new observation. This is a biased but consistent estimator for \(V\).
This approach is called approximate dynamic programming or reinforcement learning and we are going to use it to try and discover endogenously good policies.
Imagine our usual baseline scenario where there is a bunch of fish growing logistically all over the map and 300 fishers catching it. Because the fishers are too many the biomass quickly dies off.
Imagine we wanted to maximize 20 years earnings. We can go the usual route and look for the quota amount that maximizes 20 year cash-flow. That’s a classic open-loop policy optimization which we can feed to our bayesian optimizer. This generates a Bayesian posterior like this:
The maximum of the posterior is 712,038 units of fish a year. If we run a simulation with this policy we get the following result:
Basically the quota is just enough to consume the biomass over 20 years; this maximizes our only objective which is fishers’ cashflow.
Now imagine the same scenario except that we aren’t able to impose quotas (for whatever reason). Our only action, each month, is whether to open or close the fishery for the next 30 days.
Imagine we are looking at two indicators:
Intuitively there must be a way to look at biomass and months left and decide whether to open the fishery or not. When biomass is abundant we probably want to open the fishery; likewise when there are only a few months left before the 20 year mark we want to open the fishery to catch as much as possible before the deadline.
We want however the computer learn this mapping on its own. We set up a value approximation as follows: \[ \left\{\begin{matrix} V^{\text{open}} = \beta_0 + \beta_1 \text{Biomass} + \beta_2 \text{Months} & \text{open} \\ V^{\text{close}} = \beta_3 + \beta_4 \text{Biomass} + \beta_5 \text{Months} & \text{closed} \end{matrix}\right. \] Where, each month we compute \(V^{\text{open}}\) and \(V^{\text{close}}\) and allow fishing that month whenever \(V^{\text{open}}> V^{\text{close}}\). We need to discover the various \(\beta\) by running the model many times, observing transitions and running regressions again.
In practice however it turns out that more complicated approximations work best so that for this example we used a Fourier basis. The procedure is the same except the regressions are over a sum of sinusoidals rather than using features directly.
We run the model multiple times, improving our \(\beta\) at each observation and so approximate value function which in turn leads to better actions. After 150 episodes observed, we use the approximate value function we obtained on a full run to test its effectivness. The following figure shows a sample run (using the same random seed as the bayesian optimizer case)
The policy learned keeps the fishery mostly open for the first few years until the biomass is consumed, this is followed by a period of about 10 years where the seasons are extremely short followed again by a period where the fishery is left mostly open. It’s an irregular boom-bust cycle but one that in fact performs slightly better than the fixed quota system in terms of 20 years cash flow.
Knowing biomass however is kind of cheating: it’s an almost perfect indicator of the state of the fishery (its only weakness is not considering spatial distribution). A more interesting problem is how would we open and close the fishery if we could only look at human indicators. For example imagine that we can only observe the following:
We can do the same procedure and train the policy except using different indicators. The result is a similar boom-bust cycle over 20 years except more pronounced with biomass recovering above the initial values towards the middle of the run:
So there we have it, we design a feedback control loop to maximize profits over our complicated agent-based model.
For this particular example the trained controller works better than quotas in terms of profits although very close.
| Method | Reward |
|---|---|
| Quota | 412056.27 |
| Biomass controller | 458028.11 |
| Cash and distance controller | 448624.45 |
This somewhat hides the uglyness of the procedure: so far it is extremely brittle to its learning parameters, to its basis function, to episodes we train it against and instead of just converging to a good policy it has a tendency of cycling between good control and bad control.
This is partially because i am very new at this technique but I think we should understand that reinforcement learning is hard and takes time and a lot of effort to get right.
What if we take the rules we computed and strech them over a much longer 80 year period? Which one is maximizing cash-flow then?
Surprisingly even though the reinforcement learning agent has been trained to maximize profits over 20 years and one of the two indicators is “months left before the end” it adapts quite nicely to the longer time frame:
We can also train a biomass controller that doesn’t use months left as an indicator and feed him training episodes of different length, which generates control dynamics such as this:
which looks better even though it tends to make the same amount of money overall.